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in  this  study  for  training  recurrent  perceptrons  can  be  used  to  train  perceptron  networks  that  have  radically  recurrent 
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Evolving  Recurrent  Perceptions 
for  Time-Series  Modeling 

J.  R.  McDonnell,  Member,  IEEE,  and  D.  Waagen 


Abstract — Evolutionary  programming,  a  systematic  maiti-agefit 
— ■ <*— *«>■  March  technique,  b  used  to  generate  recurrent  per- 
ceptroos  (nonlinear  DR  filters).  A  hybrid  optfanbatioa  scheme  b 

the  method  of  SoBs  and  Wets,  into  die  evolutionary  program¬ 
ming  paradigm.  The  proposed  hybrid  optimisation  approach 
b  farther  augmented  by  “blending”  randomly  selected  parent 
vectors  to  create  additional  offspring.  The  first  part  at  thb  work 
investigates  the  performance  at  the  suggested  hybrid  stochastic 
search  method.  After  demonstration  on  the  Bofaachevsky  and 
Roaenbrock  response  surfaces,  the  hybrid  stochastic  optimisation 
approach  b  applied  in  determining  both  the  model  order  and 
the  coefficients  of  recurrent  perception  time-aeries  models.  An 
information  criterion  b  used  to  evaluate  each  recurrent  percep- 
trou  structure  as  a  candidate  solution.  It  b  speculated  that  the 
stochastic  training  method  Implemented  in  thb  study  for  training 
recurrent  perceptions  can  be  used  to  train  perception  networks 
that  have  radically  recurrent  architectures. 

L  Introduction 

ARTIFICIAL  neural  networks  with  recurrent  connections 
represent  an  alternative  to  feedforward  networks  for 
nonlinear  models  of  time-series  data.  Feedback  connections 
allow  for  the  feedforward  information  to  be  distributed  back 
into  the  network,  and  may  result  in  increasingly  complex 
nonlinear  manifolds  with  an  increasing  order  of  recurrency. 
This  work  demonstrates  die  application  of  die  evolutionary 
search  method  in  “evolving"  simple  recurrent  perceptions  that 
may  serve  as  building  blocks  for  more  complicated  structures. 
Once  feasibility  is  demonstrated  for  simple  recurrent  percep¬ 
tion  structures,  the  evolutionary  search  method  can  then  be 
applied  to  highly  recurrent  perception  networks  with  complex 
architectures.  Stochastic  methods  are  an  attractive  training 
option  for  complicated  architectures  because  they  are  not 
constrained  to  a  specific  network  topology.  This  feature  allows 
both  the  network  structure  and  weights  to  be  determined  during 
the  training  process. 

Simultaneously  determining  both  perception  weights  and 
structure  requires  a  procedure  that  is  amenable  to  combina¬ 
torial  optimization  problems.  Successful  algorithms  for  these 
types  of  problems  have  generally  been  stochastic  search  tech¬ 
niques  such  as  simulated  annealing  [1],  genetic  algorithms 
[2],  and  evolutionary  programming  [3].  The  evolutionary 
programming  (EP)  paradigm  has  been  shown  to  have  the 
desired  attributes:  combinatorial  optimization  capabilities  [4], 
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fig.  1.  (a)  The  nonlinear  OR  filter  structure,  (b)  The  parallel  linear-nonlinear 
DR  filter  structure.  Evolutionary  programming  can  adapt  any  activation  func¬ 
tion.  and,  since  the  model  order  is  determined  during  the  search,  transversal 
structures  may  result. 

the  ability  to  determine  model  structure  [3],  and  the  ability  to 
train  neural  networks  [6], 

The  “perception”  in  this  study  refers  to  a  recursive  adaptive 
filter  with  an  arbitrary  output  function.  Fig.  1(a)  shows  the 
proposed  perception  structure  or  nonlinear  DR  filter.  Fig. 
1(b)  shows  a  linear-nonlinear  architecture,  although,  as  will 
be  seen  in  die  later  studies,  the  linear  activation  function 
could  be  replaced  by  one  that  is  nonlinear.  This  recurrent 
perception  model  is  inspired  by  the  structure  of  infinite  im¬ 
pulse  response  (UR)  filters  and  is  postulated  for  time-series 
modeling.  This  work  applies  an  evolutionary  or  systematic 
multi-agent  stochastic  search  to  determine  the  order  of  the 
recurrent  perception  structure  as  well  as  the  tapped -delay 
line  weights.  Modifications  to  the  perception  topology  are 
accomplished  by  either  increasing  or  decreasing  the  number 
of  tapped  delays  on  the  input  or  feedback  lines,  respectively. 
These  structural  modifications  are  limited  to  a  random  change 
of  plus  or  minus  one  tapped  delay  on  a  randomly  selected  line. 
Since  the  model  is  determined  during  training,  a  possibility 
exists  that  nonlinear  finite  impulse  response  (FIR)  perceptions 
will  result  should  die  feedback  order  become  zero. 

Williams  [7]  characterizes  recurrency  based  on  its  utiliza¬ 
tion  in  a  connectionist  architecture.  Conservative  recurrence 
corresponds  to  a  tapped -delay  input  signal  and  is  termed  a 
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transversal  filter  network.  This  approach  yields  a  network  that 
is  sensitive  to  temporal  patterns  without  directly  incorporating 
recurrent  units.  Transversal  filter  networks  have  been  widely 
applied  in  the  field  of  speech  recognition  (e.g.,  Waibel  et  al. 
[8]).  Liberal  recurrence  is  the  feedback  from  the  output  to  the 
input  units  and  corresponds  to  a  nonlinear  multi-input,  multi¬ 
output  (MIMO)  DR  filter.  Williams  assigns  the  term  recursive 
filter  to  transversal  filter  networks  with  tapped-delays  on  the 
output  line  feeding  back  as  inputs.  Radical  recurrence  refers 
to  recurrent  networks  that  model  systems  with  strongly  hidden 
states.  By  definition,  if  a  system  has  a  weakly  visible  state, 
it  can  be  modeled  with  either  the  transversal  or  recursive 
filter  networks.  The  radical  approach  allows  coupling  effects 
that  are  not  possible  with  die  more  traditional  transversal 
and  recursive  filter  approaches.  Both  structures  shown  in 
Fig.  1  have  liberal  recurrence  with  the  potential  to  reduce 
to  conservative  recurrence  during  the  evolutionary  training 
process. 

Feedforward  networks  have  been  successfully  used  for 
both  time-series  prediction  and  system  modeling,  generally 
using  tapped-delay,  or  transversal  filter,  network  structures. 
However,  these  networks  are  not  necessarily  the  typical  feed¬ 
forward  configuration  trained  solely  by  error  backpropaga- 
tion.  For  example,  a  Connectionist  Normalized  Linear  Spline 
Network  (CNLS)  has  been  formulated  by  die  Center  for 
Nonlinear  Studies  at  Los  Alamos  National  Laboratory  [9-1 1]. 
Pruning  connections  [12-13]  or  weight  sharing  [14]  can  im¬ 
prove  generalization  capabilities  as  well  as  increase  processing 
throughput,  since  architecture  size  is  reduced.  Using  only  the 
most  recent  observation,  Rao  and  Ramamurti  [IS]  generate  a 
radically  recurrent  network  based  upon  a  cascade-correlation 
[16]  approach. 

Saravanan  [17]  utilizes  a  purely  recurrent  structure  so  that 
next-step  estimates  are  only  a  function  of  previous  estimates. 
This  network  is  trained  using  evolutionary  search  methods 
Recurrent  neural  network  structures  have  also  been  success¬ 
fully  trained  using  EP  by  McDonnell  and  Waagen  [18]  and 
Angeline  et  al.  [19].  Other  types  of  recursive  structures  that 
have  been  evolved  include  finite  state  machines  [3]  and  the 
order  and  coefficients  of  ARMA  models  [S],  [20].  While 
“optimal  prediction  can  be  thought  of,  quite  simply,  in  terms  of 
optimal  filtering  in  absence  of  measurements”  [21],  practical 
applications  make  use  of  recent  observations.  Li  and  Haykin 
[22]  and  McDonnell  and  Waagen  [18]  utilize  both  a  window  of 
observations  and  a  window  of  previous  estimates  for  nonlinear 
time-series  prediction.  If  an  event  occurs  which  precludes 
making  an  observation,  then  substitution  of  the  estimate  for 
past  observations  may  suffice,  depending  on  the  accuracy  of 
the  model  and  noise  levels. 

The  combination  of  more  efficient  local  search  methods 
with  global  techniques  is  appealing.  As  Yao  [23]  states, 
“die  efficiency  of  evolutionary  training  can  be  improved  by 
incorporating  a  local  search  procedure  into  the  evolution.” 
However,  this  requirement  may  limit  the  applicability  of  the 
global  search  method  to  specific  types  of  architectures  since 
local  search  techniques  are  somewhat  restrictive.  To  alleviate 
this  concern  and  maintain  the  integrity  of  the  stochastic 
search,  only  direct  search  methods  are  considered  for  being 


embedded  into  EP.  This  rationale  was  successfully  used  in  the 
development  of  the  stochastic  direction  algorithm  by  Waagen 
et  al.  [24]. 

Before  discussing  the  optimization  of  recurrent  perceptions, 
Section  II  describes  and  demonstrates  a  hybrid  optimization 
approach  that  combines  the  Solis  and  Wets  random  optimiza¬ 
tion  technique  and  EP.  Variants  of  both  methods  are  applied 
to  finding  extrema  of  an  unknown  function.  A  hybrid  strategy 
is  subsequently  developed  that  embeds  the  Solis  and  Wets 
technique  within  EP.  Once  die  hybrid  approach  has  been 
successfully  demonstrated,  die  recurrent  perception  structure 
is  discussed  in  Section  DL  Results  are  then  given  in  Section 
IV  for  evolving  recurrent  perception  next-step  predictors  for 
a  variety  of  time-series  data. 

n.  Multi-Agent  Stochastic  Search 

A.  Benefits  of  Stochastic  Optimization 

As  a  direct  search  method,  stochastic  optimization  does  not 
require  derivatives  of  the  objective  function  nor  continuity 
of  the  response  surface  [25].  The  advantages  of  random 
search  methods  include  ease  of  implementation,  insensitivity 
to  die  type  of  criterion  function,  efficiency,  flexibility,  and 
die  generation  of  information  about  the  response  surface  [26], 
Efficiency  refers  to  the  allocation  of  resources  for  evaluating 
additional  points  on  die  response  surface  versus  deciding 
which  point  to  evaluate  next  Of  course,  this  can  be  detrimental 
if  the  criterion  function  requires  an  extensive  amount  of 
computation.  If  it  is  computationally  expensive  to  evaluate  the 
objective  function,  then  the  information  generated  during  the 
course  of  the  stochastic  search  can  be  used  to  direct  the  search 
procedure.  Pierre  [27]  stipulates  that  the  following  search 
evaluation  criteria  should  be  considered  before  selecting  any 
particular  optimization  strategy:  “1)  How  much  computational 
equipment  is  required?  2)  Has  the  search  technique  proved 
to  be  completely  successful  on  similar  types  of  performance 
measures?  3)  What  accuracy  is  required  of  the  search?  4)  What 
is  a  fair  measure  of  the  cost  of  die  search?  5)  How  will  the  time 
spent  in  evaluating  the  performance  measure  and  its  derivative, 
if  used,  compare  with  the  time  spent  on  other  aspects  of  the 
search?” 

In  response  to  these  issues,  some  generalizations  may  be 
made  with  respect  to  evolutionary  search  strategies.  1)  The 
computational  resources  must  provide  sufficient  memory  and 
processor  power  to  conduct  N  separate  searches,  since  evo¬ 
lutionary  methods  are  based  on  multi-agent  search  strate¬ 
gies.  Most  implementations  occur  on  serial  platforms  even 
though  multi-agent  search  strategies  are  inherently  paraileliz- 
able.  Considerable  computational  resources  may  be  required 
if  the  problem  has  an  extremely  high  dimensionality.  2)  As 
previously  discussed,  evolutionary  search  strategies  are  an 
excellent  means  to  solve  combinatorial  optimization  problems 
and  discover  globally  optimal  solutions.  3)  The  issue  of 
accuracy  has  ramifications  with  respect  to  a  priori  knowledge 
of  the  response  surface.  If  a  correct  model  structure  is  assumed, 
evolutionary  search  strategies  will,  in  general,  tend  to  be 
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slower  than  traditional  optimization  schemes.  This  results 
from  the  inefficiency  of  not  using  information  about  the 
gradient  (although  gradient  methods  can  be  incorporated  in 
parallel  with  the  evolutionary  search  strategy).  However,  die 
time  complexity  for  evolutionary  search  does  not  necessarily 
increase  dramatically  with  increased  dimensionality  [28]  or 
additional  constraints.  In  sum,  evolutionary  search  strategies 
are  robust  across  a  broad  spectrum  of  problem  domains.  4) 
The  number  of  function  evaluations  is  a  useful  metric  for 
comparing  evolutionary  search  strategies.  Time  complexity  or 
accuracy  may  serve  as  a  useful  metric  for  comparisons  with 
other  search  methods.  S)  The  matter  of  efficiency  is  discussed 
in  the  previous  paragraph. 

B.  Single-Agent  Stochastic  Search 

Random  optimization  has  traditionally  been  based  on  single¬ 
agent  stochastic  search  (SASS)  strategies.  Both  Rao  [25] 
and  Kamop  [26]  generate  a  random  walk  sequence  to  an 
extremum  by  perturbing  the  search  point  with  a  uniform 
random  variable.  Rao  also  exploits  the  directionality  of  the 
randomly  generated  vectors  that  continue  to  yield  lower  valued 
objective  functions.  In  an  algorithmic  formulation  similar  to 
that  of  Rao,  Matyas  utilizes  Gaussian  perturbations  about  the 
search  point  along  with  a  bias  term  to  direct  the  search  [29]. 
Solis  and  Wets  [30]  have  enhanced  this  approach  by  evaluating 
the  objective  function  at  z  -  6x  if  evaluation  at  x  +  Sx  does 
not  improve  the  current  value  of  the  objective  function  and  by 
incorporating  a  variable  perturbation  variance.  The  bias  and 
additional  function  evaluation  serve  as  stochastic  equivalents 
to  incorporating  momentum  and  gradient  information.  Baba 
has  successfully  applied  the  method  of  Solis  and  Wets  to 
training  feedforward  networks  to  predict  SO2  concentrations 
in  air  [31]. 

Algorithm  1  from  Solis  and  Wets  [30]  was  used  in  the 
studies  presented  here.  The  variance  of  the  perturbation  size 
(  is  controlled  by  the  repeated  number  of  successes,  sent, 
or  failures,  fent,  in  decreasing  the  objective  function  /.  The 
contraction  ct  and  expansion  ex  constants,  as  well  as  the  upper 
and  lower  bounds  on  standard  deviation  of  the  random  pertur¬ 
bations  a  are  set  by  the  user.  The  Algorithm  1  variant  gives 
the  basic  Solis  and  Wets  method  global  optimization  capability 
by  increasing  the  standard  deviation  of  the  perturbation  when 
it  falls  below  an  arbitrary  lower  bound  (see  step  2  below).  The 
formulation  is  described  as  follows 

1.  Initialize  the  search  vector  xo  and  bias  vector  bo  =  0. 
Set  k  =  0,  scni-0,fcnt-0.  Fix  ex,  ct.  Sent,  Fcnt.  out,on 
and  initialize  oa  =  1. 

2. 

ex  •  ok-i  if  sent  >  Sent 
Set  =  «  ift  fcnt>Fcnt 

,  at,- 1  otherwise 

3.  Generate  a  multivariate  Gaussian  random  vector  &  ~ 
N(bk,oI). 

4(a).  If  /(x*-K»)  <  /(xa).  then  set  xfc+1  =  xt+(t  and 
bfc+i  =  0.4&  +  0.2b*,  scnt=scnt+ 1 ,  fcnt=0. 


4(b).  Otherwise,  if  J(xk  -  &)  <  f(xk)  <  /(x*  +  (k), 
then  set  xk+l  =  xk  -  and  bfc+i  =  b*  -  0.4&, 
scnt=scnt+ 1 ,  fcnt-0 

4(c).  Otherwise „  xfc+i  =  x*  and  bk+\  =  OSbk./cn^/cn/ 
+1,  scnt^O. 

5.  If  k  =  maximum  number  of  iterations  then  stop,  else 
k  =  k  +  1  and  go  to  Step  2. 

The  coefficient  values  0.4  and  0.2  are  retained  from  Solis  and 
Wets'  results  [30].  Note  that  the  conditions  in  Step  2  are  not 
mutually  exclusive.  The  standard  deviation  o  specifies  the  size 
of  the  sphere  that  most  likely  contains  the  perturbation  vector, 
and  the  bias  term  6  locates  the  center  of  the  sphere  based 
on  directions  of  past  success.  Step  4(b)  implements  a  reversal 
strategy  seeking  a  better  solution  in  the  direction  opposite  to 
that  of  initial  perturbation. 

Optimization  experiments  were  conducted  to  find  the  point 
(xi,X2i)  which  minimizes  the  Bohachevsky  [32]  function 
/(x)  =  x\  +  2x\  -  0.3co6(3xxi)  -  0.4cos(4xx2)  +  0.7. 
The  standard  deviation  was  initialized  as  o0  =  1.0,  and  x 
was  initially  sampled  in  the  region  x  €  [-25, 25]2  for  all 
the  experiments  conducted.  The  transcendental  terms  generate 
many  local  minima  within  die  region  x  €  [-1,  l]2  while 
the  quadratic  terms  dominate  the  surface  structure  outside 
this  interval.  A  unique  global  minimum  exists  at  x  =  (0, 0). 
The  first  set  of  experiments  consisted  of  using  the  Solis  and 
Wets’  algorithm  outlined  above.  The  second  set  of  experiments 
employed  the  same  algorithm,  except  that  the  bias  term 
was  not  used.  A  third  set  of  experiments  employed  Gauss¬ 
ian  perturbations  having  a  standard  deviation  proportional 
to  the  magnitude  of  the  objective  function  such  that  £  ~ 
N(b,  /(x)I).  The  final  set  of  experiments  did  not  incorporate  a 
bias  term  so  that  £  ~  N( 0,  /(x)I).  The  variance  modification 
parameters  are  the  same  as  those  reported  in  [30]:  ex  = 
2,cf  =  0.5,  Sent  =  5,  and  Fcnt  =  3.  The  upper  and  lower 
bounds  on  the  standard  deviation  were  set  as  aub  =  1.0 
and  on  =  0.00001,  respectively.  The  average  results  of 
10  trials  are  shown  in  Fig.  2.  Based  on  these  experiments 
of  low  dimensionality,  it  appears  that  the  accuracy  of  the 
extremum  point  may  be  improved  significantly  if  the  standard 
deviation  of  the  random  perturbations  is  allowed  to  expand 
and  contract  independently  of  the  response  surface  height. 
Also,  roughly  an  order  of  magnitude  improvement  in  the 
cost  function  was  observed  using  a  random  perturbation  (  ~ 
N{ 0,  y//(x) I)  as  opposed  to  f  ~  Af(0,/(x)I).  Modification 
of  the  perturbation  size  remains  an  active  area  of  research,  as 
exemplified  by  work  in  evolution  strategies  [33]  and  meta-EP 
[34]. 

As  successful  as  the  basic  Solis  and  Wets  algorithm  ap¬ 
pears,  search  surfaces  may  be  encountered  for  which  global 
optimization  is  not  practical  if  0*1  is  continually  less  than 
some  critical  standard  deviation  that  guarantees  one  can  tunnel 
from  any  point  on  the  search  surface  to  an  extremum  with 
reasonable  likelihood.  To  reduce  the  occurrence  of  entrapment 
conditions  encountered  in  SASS  strategies,  it  is  suggested  that 
the  Solis  and  Wets  random  optimization  method  be  embedded 
in  a  multi-agent  stochastic  search  such  as  EP.  Even  if  0^  is 
not  constrained,  a  multi-agent  search  technique  will  provide  a 
more  rigorous  search  over  high-dimensional  spaces. 
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Hg.2.  A  corapwiioto  of  varanrs  of  (he  Solis  and  Wets  random  opcimtzMioa 
method  m  xppbed  to  fa  B<Mdgnty  function  end  wwfed  over  ten  Bill*. 
It  RppciRil  that  higher  eccaracy  is  Rtijnrrf  by  silowing  Ac  variance  of  the 
*— a**1  pertnrbsttooi  a  expend  end  contract  independent  of  the  response 
sarface  height 

C.  The  Evolutionary  Programming  Paradigm 

In  1958,  Brooks  [35]  described  a  creeping  random  method 
where  k  points  were  generated  via  Gaussian  perturbations 
about  a  search  point  The  best  point  was  kept  and  the  process 
repeated.  Brooks  observed  that  “there  are  some  rather  intrigu¬ 
ing  analogies  that  can  be  made  between  the  creeping  random 
method  and  evolution.”  This  analogy  was  also  apparent  to  Fo- 
gel  et  al.  [3]  who  proposed  a  population-based  random  search 
strategy  termed  evolutionary  programming  where,  instead  of 
keeping  the  single  best  point,  a  population  of  search  points  is 
maintained. 

EP  is  a  systematic  multi-agent  stochastic  search  (MASS) 
paradigm  that  can  be  used  for  finding  global  extrema  on 
response  surfaces.  Although  the  EP  methodology  simulates 
the  evolutionary  process  found  in  nature,  the  mechanisms 
incorporated  in  this  framework  and  resulting  characteristics 
may  also  be  found  in  some  of  the  stochastic  optimization 
techniques  previously  discussed.  Normally  distributed  pertur¬ 
bations  are  applied  to  the  jtk  element  in  the  solution 
vector  Xi  according  to  Sxij  ~  N(0,S/,ij  ■  Ji  +  /?,-,)  where 
S/,<j  is  the  scale  factor,  7,-  is  the  magnitude  of  the  objective 
or  criterion  function  corresponding  to  x,,  and  Pa  is  an  offset 
vector  (5).  The  scale  factor  can  be  considered  as  a  probabilistic 
analog  to  the  step-size  used  in  gradient  methods.  Similar  to 
the  hill-climbing  and  tunneling  ability  of  simulated  annealing 
relaxation  methods,  EP  employs  a  competition  process  that 
allows  less  fit  organisms  (search  points)  to  be  retained  in  the 
population  in  a  probabilistic  fashion.  The  competition  process 
is  viewed  as  a  competitive  annealing  mechanism.  An  EP 
optimization  algorithm  similar  to  that  in  [5]  is  given  below: 

1.  Form  an  initial  population  P  =  [xoX|X2 •••X2/v_ij 
of  site  2 N  by  randomly  initializing  each  n-dimensional 
solution  vector  Xj.  A  user-specified  search  domain  x,  € 

"“7  **  imposed. 

2.  Assign  a  cm  to  each  element  x\  in  the  population  based 
on  the  associated  objective  function  Jt  =  $(xi)*.t.$  : 
RP  -*  R. 


3.  Reorder  the  population  in  descending  order  based  on  the 
number  of  wins  generated  from  a  stochastic  competition 
process.  Wins  are  generated  by  randomly  selecting  other 
members  in  the  population  x,  and  incrementing  the  win 
counter  tu*  if  Ji  <  J,. 

4.  Generate  offspring  (x*  •  •  X2N-1)  from  the  N  highest 
ranked  elements  (xo  ■  •  -Xjv-i)  in  the  population  by  mod¬ 
ifying  each  element  Xij  €  x,  with  a  random  perturbation 
Sx^  ~  N(Q,Sf'ij  ■  Ji  +  0ij)  such  that 

ti+Wj  =  *ij  + 

5.  Loop  to  Sup  2. 

A  trajectory  of  the  best  population  member  at  each  generation 
during  a  search  on  the  Bohachevsky  surface  is  superimposed 
on  the  Bohachevsky  contours  as  shown  in  Fig.  3(a).  Fig.  3(b) 
shows  best  cost  in  the  population  at  each  generation  of  the  EP 
optimization  process.  Hie  search  is  stochastic,  so  it  is  expected 
this  trajectory  will  vary  in  every  trial. 

A  variety  of  other  techniques  may  be  employed  as  alterna¬ 
tives  to  the  methods  given  above.  For  example,  each  additional 
offspring  can  replace  the  least  fit  organism  in  the  population, 
as  is  done  by  [17]  and  [36].  In  a  parallel  implementation. 
Yip  and  Pao  [37]  generate  a  multitude  of  offspring  from  each 
parent  and  replace  the  parent  with  the  best  offspring  in  a 
probabilistic  manner  using  simulated  annealing.  When  the  off¬ 
spring  are  generated  with  structural  modification(s),  some  level 
of  parameter  optimization  should  occur  rapidly  to  reduce  the 
occurrence  of  discarding  good  structures.  One  approach  might 
be  to  allow  these  new,  higher-cost  members  of  the  population 
to  mature  by  modification  of  the  objective  function  according 
to  J'(x,k)  =  (l  -  e-(r  t+“))  J(x)  where  the  maturity  level 
k  of  a  population  member  could  be  determined  by  how 
many  generations  it  has  existed  within  the  population  and  the 
parameters  r  and  a  are  user-specified.  A  deterministic  means 
to  minimize  redundancy  might  also  be  employed  to  delay  the 
potential  dominance  of  a  single  member  in  the  population. 
After  all,  only  a  single  solution  is  required  and  convergence  of 
all  the  members  in  the  population  to  a  single  point  reduces  the 
effectiveness  of  the  search  process  in  exploring  other  portions 
of  the  search  space.  Retention  of  higher-cost  search  points  can 
be  done  probabilistically  by  setting  the  number  of  competitions 
to  an  arbitrarily  low  value,  thus  allowing  a  more  relaxed 
search.  As  the  number  of  competitions  increases,  the  retention 
of  the  more  fit  individuals  becomes  more  deterministic.  Fogel 
[34]  discusses  other  variants  of  the  EP  paradigm. 

The  EP  search  outlined  above  was  augmented  with  simple 
bisection  search  capabilities  by  averaging,  or  “blending,” 
randomly  chosen  vectors  according  to  x<>  =  0.5(x,  +  x_,) 
where  Xi  and  Xj  are  the  randomly  chosen  parent  vectors 
selected  from  (xo,  •  ■  • , x/v-t }  and  Xo  is  the  offspring  vector. 
This  was  done  for  half  foe  offspring  while  foe  other  half  were 
generated  using  the  perturbation  approach  6ii  ~  N(0,  J,  - 1) 
where  foe  covariance  matrix  is  an  identity  matrix  scaled  by 
the  value  of  foe  objective  function.  The  normal  perturbations 
complement  die  averaging  or  bisection  search  method  since 
it  is  unlikely  that  the  best  solution  point  exists  on  the  line 
between  two  search  points.  Likewise,  blending  elements  of  the 
population  complements  foe  random  walk  procedure  generated 
by  foe  evolutionary  search  if  it  is  assumed  that  the  population 
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fig.  3.  Global  optimization  via  evolutionary  programming,  (a)  The  trajectory  of  die  beat  member  in  die  population  at  each 
generation  it  auperimpoacd  on  die  contour  plot  of  the  Bohachevsky  function  with  Sf  *  1,  50  parents.  20  competitions,  (b)  The 
cott  of  the  best  member  in  the  population  at  each  generation.  Since  EP  is  a  stochastic  optimization  process,  the  trajectory  shown 
in  (a)  win  undoubtedly  vary  for  each  trial. 


points  are  distributed  about  an  extremum  point.  The  average 
results  for  10  optimization  trials  on  the  Bohachevsky  function 
are  shown  in  Fig.  4.  The  next  experiment  decoupled  the  cost 
function  from  the  perturbation  size  so  that  6x<  ~  N(0,1). 
By  decoupling  the  perturbation  variance  from  the  objective 
function,  roughly  an  order  of  magnitude  improvement  in 
the  best  member  of  generation  100  was  observed  as  shown 
in  Fig.  4.  Fogel  [5]  reports  requiring  an  average  of  65.5 
generations  over  20  trials  to  achieve  k>g10(/(x))  <  -6  using 
the  same  number  of  parents  (50)  and  offspring  (one  per  parent). 
By  decoupling  the  cost  function  and  implementing  a  simple 
bisection  search,  it  took  less  dun  30  generations,  as  averaged 
over  10  trials,  to  achieve  similar  results.  The  bisection  search 
will  not  provide  an  advantage  unless  the  global  optimum  is 
bounded  by  a  portion  of  the  population. 

D.  A  Hybrid  Approach 

Single-agent  stochastic  search  methods  can  be  easily  in¬ 
corporated  into  EP  without  sacrificing  the  integrity  of  the 
evolutionary  search  procedure.  A  variant  of  EP  is  proposed 
to  take  advantage  of  die  benefits  offered  by  Solis  and  Wets 


Generation 

Fig.  4.  Augmenting  the  EP  search  strategy  with  bisection  search  where  the 
offspring  are  generated  by  avenging  two  randomly  selected  parent  vectors 
Half  of  the  aShp^ng  were  generated  by  mutation,  the  other  half  by  avenging 
pairs  of  randomly  selected  parent  vectors.  The  bisection  search  is  useful  when 
combined  with  mnhi  sgmt  stochastic  methods  that  distribute  the  population 
about  a  global  extremum. 


and  blending  methods  discussed  in  the  preceding  sections.  It 
is  speculated  that  different  random  optimization  methods  will 
prevail  on  different  types  of  response  surfaces.  Based  on  its 
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Rf.  5.  One  gtowmion  o f  die  hyfcrid  mufti  egent  Mod— tic  March.  The  fim  tel  of  ottprim  is  generated  by  pertutbing  die  parent 
vectors  with  N(0,  a)  random  variables.  The  Handled  deviation  c»n  be.  tied  to  the  height  of  the  response  surface  so  that  0  =  5/ -cost. 
The  second  set  of  ofbpring  is  generated  by  an  avenging  or  Mending  (voces*.  The  third  set  of  offspring  is  generated  according  to 
the  Solis  and  Weis  algorithm  and  dctcnninis6caHy  replace  the  parents  of  a  lower  coat  is  achieved.  Note  that  the  variances  in  the 
offspring  are  f~—r»«~i  by  different  methods  and  are  not  die  same. 


multi-agent  search  capabilities,  EP  is  an  attractive  framework 
for  combining  a  variety  of  stochastic  search  procedures.  Al¬ 
though  the  previous  experiments  in  a  two-dimensional  search 
space  are  not  conclusive,  the  following  properties  appear 
potentially  beneficial  to  a  hybrid  approach:  1)  multi-agent 
search  and  variance  expansion  tend  to  avoid  local  minima, 
2)  information  garnered  during  die  search  process  about  the 
response  surface  can  be  used  to  direct  the  search,  and  3) 
convex  optimization  and  perturbation  variance  reduction  inde¬ 
pendent  of  the  response  surface  height  may  improve  accuracy. 
Making  the  perturbation  variance  proportional  to  the  value  of 
the  cost  function  may  not  always  yield  optimal  performance. 
Decoupling  the  perturbation  variance  from  the  cost  function 
value  may  prove  beneficial  since  the  shape  of  the  response 
surface  is  often  not  well  known  and  may  even  take  on 
negative  values.  A  similar  decoupling  strategy  was  employed 
by  Waagen  et  al.  [24].  If  the  height  of  the  response  surface 
is  known  a  priori,  then  the  offset  0  (see  Step  4  of  the  EP 
algorithm)  in  the  standard  deviation  of  the  perturbations  can 
be  incorporated.  Unfortunately,  knowledge  of  the  height  of 
the  response  surface  generally  corresponds  to  determining  the 
location  of  the  global  extrema.  If  cost  functions  for  which 
die  optimal  value  is  zero,  such  as  mean-squared  error,  are 
employed,  then  this  issue  is  less  significant. 

Figure  S  illustrates  a  hybrid  approach  that  employs  different 
methods  for  offspring  generation  within  EP.  While  parents  are 
selected  from  die  whole  population  in  the  usual  fashion  [3], 
the  manner  in  which  the  offspring  are  generated  is  variable. 
The  first  set  of  offspring  results  from  parent  search  points  that 
are  perturbed  by  a  random  vector  fix  ~  N(Q,<t  ■  I)  where 
o  can  be  fixed  [24],  proportional  to  the  corresponding  height 
of  the  response  surface  [3],  or  conditionally  based  on  search 
performance  [33].  The  second  set  of  offspring  results  from 
Mending  die  parameters  associated  with  a  pair  of  randomly 
chosen  parents.  This  may  be  as  simple  as  averaging  all  of  the 
elements  in  die  search  string.  If  the  model  structure  is  part  of 
the  search  vector,  then  both  die  first  and  second  set  of  offspring 
can  easily  accommodate  changes  in  the  model  structure  as 
well  as  die  weights.  The  final  set  of  offspring  is  generated 


using  the  method  of  Solis  and  Wets,  which  acts  on  the  existing 
parent  structures.  Each  offspring  in  this  third  set  will  replace 
its  parent  if  die  offspring  has  lower  objective  function  than  its 
predecessor.  The  type  of  convex  optimization  method  applied 
to  the  second  set  of  offspring  may  also  be  applied  to  other 
parameters,  such  as  the  bias  vector  b.  For  example,  if  the 
Solis  and  Wets  bias  term  is  included  in  die  search  string,  then 
die  first  set  of  offspring  is  instantiated  with  b  =  0,  while 
a  member  of  the  second  set  of  offspring  will  have  a  bias 
vector  determined  using  bo  =  0.5  *  (b(  +  b,-),  thereby  taking 
the  average  of  die  bias  vectors  from  two  randomly  selected 
parents.  The  other  Solis  and  Wets  parameters  are  instantiated  in 
a  similar  manner,  as  are  the  structural  parameters  (i.c.,  model 
order,  in  this  study).  Although  the  Solis  and  Wets  method 
could  be  repeatedly  applied  to  the  offspring,  and  has  been  done 
so  with  success,  it  is  speculated  that  the  different  offspring 
strategies  offer  a  more  robust  search  as  well  as  help  to  maintain 
diversity. 

The  average  cost  from  10  trials  using  the  hybrid  approach 
on  the  Bohachevsky  function  is  shown  in  Fig.  6.  The  hybrid 
technique  achieves  an  accuracy  of  10  orders  of  magnitude 
within  30  generations  (this  corresponds  to  a  maximum  of  7530 
function  evaluations).  In  order  to  ascertain  which  optimization 
procedure  was  being  utilized,  a  histogram  was  generated  from 
two  sample  optimization  runs  on  the  Bohachevsky  response 
surface  as  shown  in  Fig.  7.  These  runs  contained  10  parent 
vectors  and  two  sets  of  10  offspring  vectors  generated  via 
normal  perturbations  and  blending,  respectively.  One  run  was 
made  with  fixed  variance  perturbation  vectors  (  ~  N(0,I); 
the  second  trial  was  conducted  with  the  perturbation  variance 
proportional  to  the  cost  function  f,  ~  N(0,  J  •  I).  When 
N( 0,  J  •  I)  perturbations  were  incorporated,  the  perturbation 
technique  was  the  predominant  beneficial  search  mode.  When 
N( 0,1)  perturbations  were  used,  the  Solis  and  Wets  search 
technique  provided  most  of  the  optimization  capability.  The 
averaging  method  rarely  yielded  the  lowest  cost  member  in 
the  population. 

Before  dismissing  the  blending  approach  as  inadequate, 
some  comments  should  be  made  regarding  these  results.  It 
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Fig.  6.  Optimization  of  (be  Bohachevsicy  function  wing  die  hybrid  search 
stiwegy  with  50  parent  points.  The  second  set  of  offspring  has  a  fixed 
perturbation  variance  0  =  1. 


Fig.  7.  A  histogram  that  shows  the  frequency  of  the  occurrence  of  the 
generation  member  with  the  lowest  cost  for  the  Bohachevsky  response  surface 
The  Af(0, 1)  perturbations  are  generally  too  huge  for  the  small  global  well, 
and  optimization  occurs  primarily  via  the  Solis  and  Wets  technique.  The 
iV(O.cost)  perturbations  ate  small  enough  for  optimization  to  occur  within 
the  global  well.  The  blended  or  averaged  set  of  offspring  lately  occurs. 

is  expected  that  the  top  members  (say,  vectors  1-5)  of  the 
population  will  tend  to  be  the  best  if,  only  by  default,  better 
solutions  are  not  found.  The  larger  perturbations  will  generate 
points  outside  the  small  diameter  of  the  global  well  as  observed 
in  the  £  ~  N (0, 1)  section  of  the  histogram.  This  is  also  true 
in  some  respect  to  the  Solis  and  Wets  optimization  procedure, 
since  it  was  instantiated  with  a  unit  variance  and  discrete 
changes  occur  to  the  variance  based  on  die  performance  of  the 
search  vector.  When  the  Bohachevsky  function  was  artificially 
elevated  so  that  the  global  minimum  had  the  corresponding 
cost  /( 0,0)  =  10,  the  histograms  for  the  two  methods 
were  virtually  identical.  Since  subsequent  investigations  in 
evolving  perception  architecture  relied  heavily  on  the  blending 
approach,  the  conclusion  is  drawn  that  implementing  a  variety 
of  stochastic  methods  within  EP  provides  a  robust  approach 
for  the  optimization  of  problems  whose  response  surface  is 
not  well  known.  Incidentally,  both  of  the  tuns  that  generated 
the  histograms  in  Fig.  7  yielded  nearly  identical  levels  of  cost 
after  1000  generations. 

The  Rosen  brock  [38]  function  /(x)  =  100(zf  -  12)2  + 
(1  -  zi)3  was  chosen  for  comparing  the  hybrid  approach  to 
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Fig.  S.  Compsring  optimtzatioo  methods  wmg  Solis  and  Wets  sad  the  hybrid 
ipprouch  an  the  Rasenbrock  response  surface.  Fifty  parea  points  were  used 
in  the  hybrid  technique.  Each  generation  corresponds  to  a  maximum  of  ISO 
function  evaluations  for  each  method.  The  secood  set  of  offspring  in  the  hybrid 
approach  have  a  fixed  perturbation  variance  of  a  =  1. 

the  Solis  and  Wets  random  optimization  method.  This  function 
has  a  unique  global  minimum  at  x  =  (1, 1)  and  is  referred 
to  as  a  “banana  valley”  because  it  contains  a  steep  valley 
along  X2  =  xj.  Figure  8  shows  the  average  and  best  results 
from  10  trials  for  both  the  Solis  and  Wets  technique  and  the 
hybrid  approach  outlined  above.  In  an  effort  to  compare  the 
search  processes  based  on  an  equivalent  number  of  maximum 
function  evaluations,  each  generation  equals  a  maximum  of 
150  function  evaluations  for  the  Solis  and  Wets  method  and  a 
maximum  of  150  function  evaluations  for  the  hybrid  approach. 
Both  average  curves  show  optimization  was  still  occurring 
after  the  maximum  number  of  generations  or  iterations  had 
been  reached  and  the  experiment  was  arbitrarily  halted.  These 
results  compare  favorably  with  generic  EP  results  [5]  where 
it  is  reported  that  it  took  an  average  of  86  generations  (over 
20  trials)  to  achieve  an  accuracy  of  log10(/(x))  <  -4.  It 
should  be  noted  that  the  results  generated  from  the  Solis 
and  Wets  method,  by  itself,  are  also  comparable.  Now  that 
the  capabilities  of  the  hybrid  stochastic  search  have  been 
demonstrated  on  well-known  response  surfaces,  the  next  step 
is  to  determine  its  effectiveness  in  evolving  simple  recurrent 
perception  structures  similar  to  those  shown  in  Fig.  1 . 

III.  Evolving  Recurrent  Perceptrons 
A.  Motivation 

The  recursive  structures  shown  in  Fig.  1  are  referred  to  as 
recurrent  perceptrons  because  they  incorporate  nonlinearities 
on  the  output  of  a  recursive  linear  combiner.  The  recurrent 
perception  structure  is  inspired  by  recursive  adaptive  filters 
and  the  discrete  time  equation  that  models  linear  time-invariant 
(LTI)  system  dynamics 

m  —  1  n  — 1 

y(k  +  1)  =  5Z  °«x(*  “  *)  +  5Z  b'y(k  ~  *) 

«=o  i=0 

where  x  represents  the  input  to  the  system  and  y  is  the  system 
output  While  much  is  known  about  the  stability,  controllabil¬ 
ity,  and  observability  of  LTI  systems,  as  well  as  methods  for 
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generating  models  of  such  systems,  the  same  cannot  be  said 
for  nonlinear  systems  in  general  [39].  Nonlinear  versions  of 
the  identification  models  (motivated  by  LTI  dynamics)  given 
in  Narendra  and  Parthasarathy  [39]  are  described  by 

nonlinear  parallel  model: 

(m  — 1  n  —  1  \ 

53  *ix(k  -  *)  +  53  bi^k  ~  *) ) 

i=0  i=0  / 

nonlinear  series-parallel  model: 

(m—  1  n— 1  \ 

53  *ix(k  -  *) + 53  biv(k  -  *) ) 

i=0  i=0  / 

where  y  is  the  estimate  of  the  system  output.  The  nonlinear 
series-parallel  model  is  a  transversal  structure  that  utilizes 
the  actual  system  outputs.  Further,  its  linear  counterpart  is 
preferable  for  generating  stable  adaptive  laws  [39].  As  stability 
is  paramount  when  generating  recursive  filters,  stable  filters 
were  always  evolved  using  the  hybrid  stochastic  search  method 
implemented  in  this  investigation. 

The  recurrent  perception  structures  investigated  in  this  study 
are  characterized  by  the  following  difference  equations 

Class  l:  y(k  +  1)  =/(*(*),  x(k  -  1),  x(k  -  2),  •  •  • , 
x(k  —  m  +  1),  y{k),  y(k  —  1), 
y(k  —  2),  -  •  * ,  y(k  -  n  +  1)] 

Class  ll:  y(k  +  1)  =f[x(k),x(k  -  1  ),x(k  -  2), 

•  •  ■ ,  x(k  -  m  +  1),  y(k), y(k  -  1): 
y(k  —  2),  •  •  • ,  y{k  -  n  +  l)j 
+  y{i(fc),i(A:  -  1  ),x(k  —  2),  -  -  - , 
x{k  -  m  +  1),  y(k),  y(k  -  1), 
y(k  —  2),  •  •  • ,  y(k  -  n  +  l)j 

where  /  and  g  are  not  necessarily  the  same  mapping.  The 
Class  I  is  similar  to  the  Model  IV  discrete  time  plant  model 
given  by  Narendra  and  Parthasarathy  [39]  for  nonlinear  system 
identification  and  control,  except  that  the  nonlinear  transfor¬ 
mation  in  [39]  is  implemented  with  a  multi-layer  perceptron 
and  the  nonlinearity  in  this  paper  is  accomplished  using  a 
single  perception.  Class  II  is  similar  to  the  Model  III  discrete 
time  plant  model  given  in  [39],  if  the  evolutionary  search 
process  yields  an  /  that  is  dependent  only  on  previous  outputs 
[y{k),y(k  -  1),  -  -  •  ,y(fc  -  n  +  1)]  and  a  g  that  is  dependent 
only  on  past  inputs  [x(k),x(A:  -  1),  •  •  -  ,x(k  -  m  +  l)j.  The 
selection  of  these  models  in  [39]  was  “motivated  by  the  models 
that  have  been  used  in  the  adaptive  systems  literature  for 
the  identification  and  control  of  linear  systems  and  can  be 
considered  their  generalization  to  nonlinear  systems.” 

B.  Formulation 

The  recurrent  perception  model  structure  shown  in  Fig.  1(a) 
is  characterized  by  the  Class  I  difference  equation  and  can  be 
described  by 

(m-l  n—  1  \ 

53  aii{k  -i)  +  J2biy(k  -*)  +  <*] 

t=0  «=0  / 


where  the  search  strategy  must  determine  the  order  of  the 
feedforward  terms,  m,  the  order  of  the  feedback  terms,  n, 
as  well  as  the  feedforward  coefficients,  a,,  the  feedback 
coefficients,  b},  and  the  bias  a.  The  IIR  synapses  proposed 
by  Back  and  Tsoi  [40]  and  the  neurons  used  by  Li  and  Haykin 
[22]  have  the  same  structure  as  this  nonlinear  parallel  model. 

Recalling  that  polynomials  can  also  be  used  to  approximate 
any  static  mapping  /  :  Rm  — *  R”  to  an  arbitrary  degree  of 
accuracy,  and  that  the  sigmoid  function  can  be  expressed  as 
an  inverted  polynomial  series 

/(x)  =  (l+e-*)"1S=  + 

leads  to  the  suggestion  that  other  nonlinear  mappings  that  can 
also  be  expressed  by  polynomial  series  are  equally  applicable 
for  use  as  activation  functions  [41],  [42].  The  stochastic 
search  method  used  for  training  does  not  explicitly  incorporate 
knowledge  of  the  activation  function  (just  I/O  observations), 
so  any  activation  function  can  be  implemented  without  regard 
for  continuity  constraints.  By  virtue  of  their  smoothness, 
continuous  activation  functions  tend  to  possess  good  function 
approximation  properties.  The  search  may  even  be  conducted 
over  a  set  of  candidate  mapping  functions  F  such  that  /  6  F 
thereby  incorporating  the  selection  of  the  activation  function(s) 
in  the  evolutionary  optimization  process. 

The  objective  function  for  each  perceptron  is  similar  to 
Akaike’s  minimum  information  theoretical  criterion  (AIC) 
estimate  [43]  as  employed  by  Preistley  [44]  for  evaluating 
autroregressive  moving-average  (ARMA)  models 

AIC(m,  n)  =  Nln(de )  +  2(m  +  n  +  1) 

where  N  is  the  effective  number  of  observations.  An  additional 
factor  of  1  is  added  to  the  number  of  fitted  parameters  (m  +  n) 
to  account  for  the  bias  term  a.  The  MLE  of  the  innovation 
variance  [44]  is  determined  according  to 

1  N~' 
k—0 

where  the  observation  error  is  given  by  i(k)  =  y(k)  -  y(k). 
To  prevent  the  search  process  from  driving  the  number  of 
parameters  to  zero  and  stalling  at  a  large  MSE,  the  mod¬ 
ification  to  the  model  order  can  take  one  of  three  states 
(-1,0, +1).  Approximately  20%  of  the  time,  either  of  the 
conditions  (— 1,+1)  existed  for  a  randomly  selected  tapped- 
delay  line.  Thus,  if  there  are  four  tapped-delay  lines,  each  line 
is  being  modified  about  5%  of  the  time.  If  a  large  number  of 
tapped-delay  lines  are  employed,  then  the  percentage  of  time 
that  any  of  the  lines  are  affected  may  be  increased  to  maintain 
a  similar  modification  ratio. 

If  direct  linear  feedthrough  [43]  capabilities  are  desired  to 
be -present  in  parallel  with  the  nonlinear  contributions,  then 
the  perceptron  structure  can  be  reformulated  as  a  combination 
of  linear  and  nonlinear  recurrencies.  This  combined  structure 
corresponds  to  the  Class  II  model  where  g  is  a  linear  functional 
as  shown  in  Fig.  1(b)  and  is  expressed  by 

y(k  +  1)  =  yL(k  +  1)  +  yN(k  +  1) 
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Once  an  acceptable  model  has  been  detennined,  the  evo¬ 
lutionary  model  building  procedure  can  be  iteratively  applied 
to  the  residuals  as  discussed  by  PriesLey  [44].  Deterministic 
training  can  also  be  applied  to  the  evolved  model  in  an  effort 
to  “fine  tune"  the  model  coefficients.  Deterministic  training 
was  usually  not  applied  to  the  nonlinear  models  generated  in 
this  work,  either  during  or  after  the  training  process,  and  may 
potentially  have  yielded  slightly  better  results. 

C.  Deterministic  Training 

More  deterministic  methods  may  be  used  to  update  the 
perception  weights  while  the  model  structure  is  evolving,  or 
they  may  be  applied  to  the  results  of  the  stochastic  search  as  a 
post-processing  check  to  ensure  local  optimality.  The  recurrent 
perception  model  corresponding  to  Fig.  1(a)  can  be  described 
by 

yfc+i  =  /(wTz) 

where 

WT  =  [«0  Gl  •  •  '  Om-l  f>0  f>i  •  •  •  6n-l«] 

=  [x*  Xfc-1  •  •  •  Xfc-m+1  Vk  Vk-l  •  •  ■  Vk-n+l  1) 

If  the  objective  function  is  given  by  the  instantaneous  squared- 
error  Ek  —  e2  =  (zfc  -  y*)2  where  a*  is  the  desired 
output,  then  a  straightforward  gradient  approach  yields  the 
well-known  stochastic  approximation  weight  update  equation 


Fig.  9.  Evolutionary  optimization  of  the  pendulum  model.  The  lowest  AIC 
score  in  Che  population  is  shown  at  each  generation  of  the  evolutionary  search 
process. 

0.05.  A  linear-nonlinear  series-parallel  model  was  postulated 
according  to 

m— 1  n— 1 

+ 1)  =  £  -  o + £  wc*  -  o 

t=0  »=0 

(P-1  v-l  \ 

£  CiT{k  -  i)  +  £  -  *) ) 

t=0  t=0  / 

where  the  maximum  window  size  was  limited  to  five  samples. 
A  N(Q,  1)  random  forcing  function  was  used  to  generate 
50  000  samples,  5000  of  which  were  used  for  training  pur¬ 
poses.  The  search  population  consisted  of  10  parents,  each 
generating  a  single  offspring  using  EP  perturbations,  as  well 
as  another  set  of  10  offspring  by  averaging  randomly  chosen 
parents. 

The  optimization  process  took  place  over  500  generations  as 
shown  in  Fig.  9.  The  resulting  model  does  not  incorporate  the 
forcing  function,  but  instead  relies  only  on  past  observations 
of  the  pendulum  displacement  as  given  by 

9(k  +1)  =  O.87220(k)  +  O.78580(Jfc  -  1)  -  0Al278(k  -  2) 
-O.319O0(Jfc  -  3)  +  sin(O.O6870(Jfc)) 


*new  =  W  +  t?e*/'(wrz)z 

Tsoi  and  Back  [40]  have  derived  a  multi-layer  perception 
version  of  this  update  scheme  for  nonlinear  FIR  and  IIR 
perceptions.  Williams  and  Zipser  [46]  have  also  formulated 
a  batch  update  scheme  with  an  arbitrary  lag  window  for 
recurrent  multi-layer  perceptions. 

IV.  Prediction  Results 


Note  that  this  approximation  can  be  reduced  to  a  purely 
linear  system  because  of  the  small  coefficient  on  the  sin 
argument  The  simulated  system  is  almost  linear  by  virtue  of 
the  small  displacements  and  angular  velocities.  This  model  has 
a  MSE=1.54  •  10~5  for  the  training  data  shown  in  Fig.  10.  A 
test  set  was  generated  using  r(k)  =  0.5  cos(2t£/1000)  with 
the  resulting  MSE=3.69  •  10-6  on  the  testing  data  shown  in 
Fig.  11. 


A.  The  Simple  Pendulum 

Consider  the  equation  for  a  simple  pendulum  with  a 
velocity-squared  damping  term 

JO  +  B8\0\  +  K  sin  9  =  t 

where  J  =  1,  B  —  0.1,  and  K  —  1.  Fbr  simplicity,  the 
system  was  simulated  using  Euler  integration  with  a  steps  ize  of 


B.  The  Sunspot  Series 

The  second  set  of  experiments  was  conducted  on  Wolfs 
sunspot  series  for  the  years  1700-1988.  These  numbers  are 
indicative  of  the  average  relative  number  of  sunspots  observed 
each  day  of  the  year  and  serve  as  a  standard  benchmark  for 
time-series  modeling  [12],  [13],  [40]  where  the  objective  is 
to  generate  a  single-step  prediction  based  on  past  observa¬ 
tions.  Consistent  with  Weigend  et  al.  [12],  the  data  set  was 
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Fig.  10.  (»).  The  desired  output  used  in  the  training  set.  (b)  The  error  for 
each  point  in  the  training  time-series.  The  forcing  function  r  Af(0, 1)  was 
not  incorporated  in  the  resulting  model. 


(*) 


0>) 

Fig.  II.  (a).  Test  data  for  the  pendulum  model  generated  from 
r(t)  =  cos(2rk/1000.  (b).  The  error  for  the  test  time  series. 


partitioned  into  a  training  set  over  years  1700-1920  and  test 
sets  for  years  1921-1955  and  1956-1979,  respectively,  as 
the  latter  test  set  is  atypical  of  the  entire  time-series  [13]. 
Weigend  el  al.  use  12  inputs  to  8  hidden  units  in  a  12-8-1 
fully-connected  feedforward  architecture  where  the  number  of 
inputs  was  chosen  to  allow  direct  comparison  to  the  threshold 
autoregressive  (TAR)  model  of  Tong  and  Lim  (47).  Weigend 
el  al.  subsequently  reduce  the  12-8-1  network  to  a  12-3- 
1  network  by  weight-elimination.  Svarer  el  al.  [13]  employ 
the  Optimal  Brain  Damage  method  of  Le  Cun  el  al.  [48]  to 
generate  a  pruned  network  or  nonlinear  subset  model  with  5 
inputs  that  are  not  fully-connected  to  3  hidden  units  in  a  two 
layer  feedforward  network.  Priestley  [44]  describes  a  variety 
of  more  traditional  time-series  models  for  the  sunspot  data  set. 
These  include  autoregressive  (AR),  ARMA,  TAR,  and  bilinear 
models. 

Poor  results  were  usually  obtained  when  using  simple 
structures  like  that  shown  in  Fig.  1(a),  and  reasonable  results 
were  usually  found  using  the  Class  D  type  models  for  a 
variety  of  activation  functions.  Also,  nonrecurrent  models  were 
evolved  by  using  the  evolutionary  search  process  to  determine 
the  order  of  just  the  tapped -delay  input  lines.  The  maximum 
number  of  delays  was  arbitrarily  set  at  mmu  =  = 

p —  =  Qixulx  —  20  for  the  remaining  experiments  in  this 
work.  The  sunspot  data  set  was  normalized  by  a  factor  of  200. 
The  experiments  were  run  with  10  parents,  20  offspring  (10 


for  the  blending  process  and  10  utilizing  normal  perturbations 
where  the  standard  deviation  is  proportional  to  the  MSE  of 
the  network),  10  competitions,  an,  =  0.0001,  and  aub  =  1.0 
and  S/  =  1.  The  initial  order  of  the  tapped-delay  lines  was 
randomly  chosen. 

To  facilitate  comparison  with  previous  work  done  on  the 
sunspot  series,  the  average  relative  variance  was  determined 
for  the  evolved  model.  The  average  relative  variance  arv  is 
given  by  [12] 


arv  = 


Y,k{xk-xk)2 

T,k(xk-xk)2 


and  provides  a  normalized  mean  squared  error  (NMSE)  met¬ 
ric  for  comparing  the  performance  of  different  models.  The 
NMSE  is  independent  of  the  training  set  size  and  is  unity  in 
the  event  that  the  estimate  is  equivalent  to  the  mean  of  the  data, 
(t.e.,  x  =  x).  In  the  following  text,  the  am  set  {arvl,  arv2, 
arv3}  will  refer  to  the  average  relative  error  corresponding  to 
{training  set,  test  set  (1921-56),  test  set  (1957-1979)}. 

When  constrained  as  a  linear  system,  a  second-order 
transversal  filter  with  a  bias  term  as  given  by  yk+\  = 
1.2605xib  -  0.4915xk_i  -  0.1321xt_2  +  0.0831  was  evolved. 
That  is,  the  order  of  the  feedback  lines  became  zero.  The 
evolved  linear  model  has  an  arv  =  {0.1494,0.1732,0.4512}, 
which  compares  favorably  with  the  ninth-order  AR  model 
given  by  Priestley  with  an  arv  =  {0.1865,0.2235,0.4994}.  It 
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TABLE  I 

Waotrr  Set  pc*  the  Evolves  Transversal  Filter  Network 

Hidden  unit  1  Hidden  unit  2  Bias 

-0.9448 _  -1.1872 _ 05423 


Hid  until 

Hid  uni(2 

*4  *4-1 

—  1.5332  1.7664 

-1.1093 

1.3729 

*4-3 

0.3080 

**-« 

-0.2951 

**-5 

0.0252 

*4-4 

-0.2966 

*4-7 

-0.2740 

Bias 

1.1819 

-0.2191 

TABLE  II 

Weights  for  the  Evolved.  Recursive.  Filter  Network 

Hidden  unitl 

Hidden  unit  2 

Input  bias 

Output  bias 

Input  bias 

Output 

—0.6550 

13X68 

-0.2323 

0.3079 

-0.2323 

- 

*4 

X 

*4—2 

*4-3 

*4-< 

6 

Hid  mid 

-0.6512 

05742 

09500 

0.0830 

-0352 

- 

-0.0531 

0.0065 

Hid  unit2 

1.3000 

0.3824 

0.3040 

- 

* 

- 

*4-7 

ik 

ik-i 

Vk-7 

Vk-3 

bias 

Hid  unit  1 

—0.2252 

-0.1251 

0.5443 

0.3283 

-0.0391 

0.3393 

0.2483 

Hid  unit2 

- 

- 

- 

- 

- 

- 

0.1042 

is  interesting  to  note  that  die  subset  model  found  by  Svarer  et 
al.  also  includes  the  first  three  terms  {x(k),x(k—l),x(k—2)} 
as  well  as  {x(Jfc  —  7),x(fc  -  10)}.  (Neural  network  subset 
models  were  not  investigated  in  this  study  but  are  achievable 
using  EP  as  demonstrated  in  [18].)  From  this  model  structure, 
a  least-squares  estimate  can  be  found  by  forming  die  normal 
equations  from  y  =  ffw  or  equivalently 


'  V3 

V2  Vi  po  1 ' 

■wo" 

y* 

»3  Vi  V\  1 

Wi 

l 

:  :  :  : 

w2 

4/fc+l  - 

-Vt  yt-i  Vk-i  l. 

-W3. 

and  solving  for  the  weight  vector  w =  (HT H)~l HTy. 
If  this  is  done  using  the  complete  data  set,  a  pure  follower 
strategy  results,  since  w £,5  ss  [1000]T.  The  follower  strategy 
yields  an  arv  —  {0.2903,0.4268,0.9647}.  No  improvement 
was  found  when  a  gradient  search  scheme  was  applied  starting 
from  the  evolved  weight  coefficients.  This  yielded  a  slightly 
modified  weight  coefficient  vector  w  =  [1.2612  -  0.4899  - 
0.1248  0.0791]r  after  a  small  number  of  iteration 
A  transversal  filter  network  was  purposely  evolved  by  dis¬ 
allowing  feedback  of  the  previous  estimates  into  the  network. 
The  resulting  network  is  equivalent  to  a  single  hidden  layer 
network  with  two  hidden  units,  one  of  which  receives  eight 
inputs  and  one  of  which  receives  only  the  last  observation. 
The  weights  and  biases  for  this  network  are  given  in  Table 
I.  Each  node  utilizes  a  tanh  activation  function.  The  average 
relative  variance  for  this  network  is  given  in  Table  m  along 
with  arv  values  for  deterministically  trained  models.  The  better 
transversal  networks  that  are  known  [12],  [13]  have  three 
hidden  nodes  and  utilize  observations  from  an  eleventh-order 
lag  that  corresponds  to  .tire  average  period  of  the  data.  The 
best  model  is  ooly  partially  connected,  thereby  incorporating 
a  subset  of  the  actual  time-series  inputs. 


A  linear-nonlinear  model  incorporating  a  tanh  nonlinearity 
and  a  bias  term  was  evolved  as  given  by 

ik+i  —  0.5788yt  +  0.0176yt_i  —  0.1396yfc_2  +  0.0549y*_3 

-  0.0508yfc-4+  0.1030yfc_s  -  0.0239yfc_6 
+  0.01997yfc_7  -  0.4076yt  -  0.2614yfc_i 
+  0.0475  +  tanh(0.7109yt  -  0.0569yfc_i 

-  0.0375yt_2  -  0.0324yt_3  -  0.0532yt_4 

-  0.0630yt_5  -  0.0180yfc_6  +  0.0687yfc_7 
+  0.0964y„_8  +  0.0869yfc_s  +  0.0768yfc 
+  0.1740yit_i) 

This  single-step  sunspot  predictor  has  an  arv  =  {0.1260, 
0.1140,  0.3630}. 

Using  a  structure  similar  to  that  of  the  transversal  network, 
except  this  time  with  feedback  connections  to  the  hidden 
units,  a  recurrent  network  was  evolved  where  the  search 
determined  not  only  the  order  of  the  input  lines,  but  of 
the  feedback  lines  as  well.  During  the  training  process,  the 
number  of  tapped-delay  lines  on  one  feedback  loop  became 
zero,  thus  resulting  in  a  partially  feedforward  network  and 
partially  recurrent  network.  The  weights  for  this  structure 
are  given  in  Table  II.  Better  results  were  not  obtained  on 
the  sunspot  data  using  this  network  or  any  of  the  other 
evolved  recursive  filter  networks.  The  recurrent  networks’  arv 
values  are  given  in  Table  III,  where  the  first  20  observations 
have  been  substituted  for  the  estimated  values.  A  plot  of  the 
single-step  estimates  generated  from  the  recurrent  network  is 
shown  in  Fig.  12(a),  with  the  error  line  shown  in  Fig.  12(b). 
Although  better  transversal  networks  have  been  generated, 
it  is  still  suspected  that  recurrency  might  be  appropriate  for 
this  data  based  upon  the  results  discussed  by  Priestley.  Better 
results  might  be  obtained  in  evolving  both  the  transversal  and 
recurrent  structures  if  the  representation  (number  of  hidden 
nodes)  is  increased  and  the  complexity  penalty  for  the  number 
of  terms  is  relaxed. 
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TABLE  UI 

Comparison  of  Normalized  Error  Results  of  Previous  Work  on  the  Sunsfot  Data  Set  with  the  Solution  Found  Using  a  Recurrent  Structure 


Model 

Train  (1700-1920) 

Teat  (1921-1955) 

Teat  (1956-1979) 

Number  of  parameters 

Tbog  and  Urn  (44] 

0.097 

0.097 

0.28 

16 

Wttgmd  (Niter  al. )  (12] 

0.082 

0.086 

0.35 

43 

Svamr  (Nil  et  al.)  (13] 

0.090 

0.082 

0.35 

12-16 

Transversal  Net 

0.0987 

0.0971 

0.3724 

14 

Recurrent  net 

0.1006 

0.0972 

0.4361 

22 

TABLE  IV. 

Performance  of  Single-Step  Predictor*  on  tw  Logistic  Mae  for  100  Samples. 

Predictor 

*0 

AK 

MSE 

ISE 

arv 

ik+l  - 

aia(3.1476xa)  + 

0.0274 

02 

-316.7 

0.0004 

0.0004 

•- 

0.0032 

*A+i  =  «a(**4) 

02 

-213.7 

0.0012 

0.0013 

0.0091 

1-10-1  network 

02 

-117.1 

00017 

- 

0.0131 

1.15-1  network 

02 

-303.4 

0.0002 

- 

0.0015 

YEARS  lTOO-lME 
<b) 

Fig.  12.  (a)  The  evolved  recurrent  model  for  die  sunspot  data.  Training  was 
done  over  yean  1700-1920.  The  lest  set  consists  of  years  1921-1988.  The 
lint  10  data  training  points  were  not  included  in  the  model  evaluation  criteria 
(this  conespoods  lo  (he  masimum  model  order)  of  the  evolved  filler,  (b)  The 
error  trace  for  the  evolved  wcunent  single-step  sunspot  predictor.  The  data 
was  nonnaliaed  by  a  factor  of  200. 


C.  The  Logistic  Map 

Weigend  et  at.  [12]  use  the  iterated  quadratic  or  logistic  map 
Xk+i  =  4x*(l  -  Xk)  on  die  unit  interval  as  an  example  of  de¬ 
terministic  chaos.  It  cm  be  easily  shown  that  x*  =  sin* (2**0) 
is  a  solution  to  this  equation.  A  single-step  predictor  of  the 


form 

VN(k  + 1) 


(m— 1  n— 1.  \ 

53  -*)  +  51 6>V(*  “  *)  ) 

8=0  8=0  / 

+  sin  ( 53  Cii(k  - 1)  +  53  +  ° 

\«=o  i=o  / 


was  postulated.  After  50G0  generations,  die  evolutionary 
search  yielded  a  predictor  of  the  form  j/k+i  —  sin(3.1476xt)+ 
0.0274,  thereby  disregarding  the  cos  node  and  the  recurrent 
terms  (the  tapped -delay  orders  of  m,  n,  and  q  went  to  zero). 
Fig.  13  shows  the  results  of  the  evolved  solution  on  200 
points  generated  from  x0  =  0.2.  Upon  inspection  of  the 
evolved  solution,  it  was  observed  that  yjt+i  =  sin(rrxjt)  might 
serve  as  a  suitable  estimate.  Figure  14  shows  the  performance 
of  this  estimate  on  the  same  200  points  used  in  Fig.  13. 
Figure  13  gives  the  state  space  plot  for  each  of  the  estimates 
and  the  actual  quadratic  mapping  function.  Table  II  contains 
representative  AIC  and  MSE  values  for  a  100  point  sequence 
starting  with  the  given  initial  condition.  The  initial  error  at  xo 
was  neglected  in  these  calculations  because  no  observations 
have  been  made.  For  a  large  number  of  data  points,  the 
integral-squared  error  (ISE)  represents  a  closed  form  solution 
to  the  MSE  since 

MSE  =  —  53(®»+r  *»+t)  ~  y —  ®t+i)  Aii 

i=l  1=1 


In  the  limit,  the  MSE  approximation  becomes  the  ISE  that  is 
defined  on  the  unit  interval  (0,  1]  as 

ISE=  f  (xk+i  -  Xk+i)2dxt 
Jo 

which  is  also  reported  in  Table  IV. 

This  example  illustrates  the  potential  ability  of  the  search 
to  find  nonrecurrent  solutions  when  recurrent  solutions  are 
unnecessary.  Weigend  et  al.  report  using  three  radial-basis 
function  nodes  to  achieve  this  mapping  without  further  elab¬ 
oration.  Investigations  using  backpropagation  showed  dial  a 
better  solution  can  be  found  for  10  <  N  <  15  hidden  units  in 
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■  observations 
-  predictions 


■  observations 
predictions 


Fig.  13.  Logistic  map  test  data  and  the  evolved  solution  of  the  form 
xa+i  =  sin(3.1476xfc)  +  0.0274. 


Fig.  14.  Logistic  map  test  data  and  the  estimate  i* + 1  =  sin(xit)  generated 
from  inspection  of  the  evolved  solution. 


a  1-AM  feedforward  network.  The  AIC  values  given  for  these 
networks  in  Table  IV  were  determined  according  to 

A1C(NW)  =  N\n(&l)  +  2(NW) 

where  Nw  is  the  number  of  weights  and  biases  in  the  network. 


D.  The  Mackey-Glass  Equation 

The  Mackey-Glass  equation  represents  a  model  for  white 
blood  cell  production  in  leukemia  patients  [49].  This  model  is 
complicated  by  the  addition  of  a  time  delay  r  in  the  nonlinear 
differential  equation 


i(t)  = 


ax(t  -  t) 

1  +  xc(t  -  t) 


—  bx(t) 


observations 
evolved  solution 


*00 


where  the  free  parameters  are  selected  a  =  0.2,  6  =  0.1, 
c  =  10,  and  r  =  30  as  discussed  in  Jones  et  al.  [10].  The 
training  sets  consisted  of  500  data  points,  while  the  test  set  was 
comprised  of  the  subsequent  500  data  points.  This  data  was 
extracted  from  a  longer  run  so  that  the  initial  transients  would 
have  no  effect  The  data  were  normalized  by  a  factor  of  1 .4  and 
observations  were  made  every  second  that  corresponds  to  the 
simulation  stepsize.  These  experiments  incorporated  a  parallel 
sin-cos  nodal  arrangement  as  used  in  the  logistic  equation 
example.  After  5000  generations  of  training,  the  resulting 
4-5-2- 1  plus  bias  solution  has  the  form 

y*+i  =  cos(1.2881y*  +  0.7920y*_]  +  1.6431y*_2 

-  0.2588y,t_3-  0.7175yt_<  -  0.3239y* 

-  1.0519yfc_i  -  0.0572yfc_2  -  0.2l40yfc_3 

-  0.1310yt_4  —  1.1726y*_s)  +sin(2.9475y* 

+  0.8605yfc_i +0.0069yfc_2 

-  1.6800y*  -  1.0020yt_i)  -  0.9784 

Figure  16(a)  shows  the  results  of  this  model  on  the  training 
and  test  sets,  while  Fig.  16(b)  shows  the  error  on  the  same 
data  sets.  For  the  data  sample  shown  in  Fig.  16(a),  Table 
V  lists  the  AIC,  MSE,  and  arv  values  of  the  training  and 
test  sets  for  the  evolved  single-step  predictor.  The  last  450 
points  of  the  training  set  were  used  for  evaluation  purposes 


Fig.  15.  The  stale-space  plot  for  the  logistic  map.  Observations  generated 
from  x*+)  =  4x*(l  —  x* )  are  denoted  with  an  “0,"  the  evolved  solution 
estimate  it+\  —  sin(3.1476xt)  +  0.0274  is  denoted  by  an  "x."  and  the 
estimate  generated  by  inspection,  the  evolved  solution  ii  +  i  =  sin(xxt)  is 
denoted  by  a 

of  the  training  data  to  allow  the  transient  effects  due  to 
the  perceptron  recurrencies  to  die  out.  This  difference  in  the 
number  of  observations  alters  the  AIC  scores.  Nevertheless, 
the  arv  values  compare  favorably  with  the  results  found  in  the 
recurrent  network  training  evaluation  done  by  Logar  et  al.  [50] 
for  the  Mackey-Glass  equation  with  r  =  17.  The  preditor  can 
be  made  nonrecursive  by  letting  y  «—  y,  which  yields  a  MSE 
=  0.00088  and  an  arv  =  0.0232  on  the  test  set.  However,  if  a 
transversal  filter  is  desired,  it  is  suspected  that  better  results 
would  be  obtained  by  training  the  appropriate  structure.  This 
model  performs  poorly  if  forward  projections  are  generated 
by  replacing  the  observations  with  previous  estimates  so  that 
y  «—  y.  A  two-step  ahead  predictor  yields  a  MSE  =  0.0070  on 
the  test  set,  while  a  three-step  ahead  predictor  is  dramatically 
worse  with  a  MSE  =  5.1 125  on  the  test  set. 

v.  Conclusion 

This  work  has  incorporated  an  efficient  single-agent  search 
strategy,  the  method  of  Solis  and  Wets,  into  the  EP  frame¬ 
work  and  augmented  it  with  the  simplest  convex  optimization 
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•  observation 
-  prediction 


(•) 


Fig.  16.  (a).  The  evolved  solution  on  the  training  and  teat  data  generated 

from  the  Mackey-Glass  equation.  The  first  500  points  were  in  the  training 
set,  and  the  subsequent  500  points  were  used  for  the  testing  set.  Every  fifth 
observation  point  is  shown  by  a  solid  square,  (b).  The  enor  between  the 
evolved  solution  and  the  Mackey-Glass  dynamic  equations  for  the  time-series 
shown  in  Fig.  14. 


TABLE  V 

The  Performance  Results  for  the  Next-Step 
Ahead  Predictor  on  the  Mackey -Glass  Equation 


Data  set 

No.  of  effective 
observations 

AIC 

MSE 

an? 

Training 

450 

-4406.1 

0.00005 

0.0014 

Test 

500 

-4594.4 

0.00009 

0.0025 

capability,  bisection  search,  to  yield  a  hybrid  multi-agent 
stochastic  search  technique.  This  hybrid  method  can  enhance 
EP  optimization  efficiency  while  alleviating  local  minima 
problems  associated  with  single-agent  search  techniques.  A 
myriad  of  other  local  methods  could  also  have  been  easily 
incorporated  with  the  multiagent  EP  search  procedure. 

The  hybrid  method  was  applied  to  nonlinear  IIR  filters 
for  single-step  prediction  tasks.  Since  the  model  structure 
was  simultaneously  determined  along  with  the  weighting  co¬ 
efficients,  the  evolved  solutions  did  not  always  have  re¬ 
current  structure  (i.e„  transversal  filter  structures  sometimes 
resulted).  Good  single-step  prediction  ability  was  evolved 
using  nonlinear  mappings,  even  though  die  resulting  models 
were  dissimilar  to  the  models  (if  any)  that  created  the  original 
data.  The  learning  procedure  had  the  most  trouble  with  the 
noisy  sunspot  data,  indicating  that  the  representation  may  not 


be  sufficient  and/or  additional  work  is  warranted  for  systems 
with  noisy  output  Other  types  of  information-based  criteria, 
such  as  the  minimum  description  length  (MDL)  modeling 
principle  (51],  can  be  used  in  lieu  of  the  AIC  as  an  objective 
function,  and  may  yield  better  results,  since  the  AIC  is  not  a 
consistent  estimator  [52]. 

The  simple  structures  investigated  in  this  work  demonstrated 
a  reasonable  degree  of  proficiency  for  the  nonlinear  time- 
series  problems  studied.  Similar  AIC  values  were  attained  for 
a  varied  assortment  of  models  of  the  same  data  sets.  From 
these  results,  it  is  suspected  that  the  joint  parameter-function 
space  of  particular  data  sets  may  be  dense  in  the  number 
of  acceptable  solutions,  some  of  which  may  be  found  by 
the  relaxation  scheme  employed  in  this  investigation.  While 
it  is  evident  that  the  hybrid  stochastic  optimization  scheme 
provides  an  effective  learning  mechanism,  the  issue  of  what 
types  of  time-series  representations  can  be  effectively  modeled 
using  this  approach  has  not  been  addressed  in  this  investiga¬ 
tion.  By  cascading  and/or  parallelizing  recurrent  perceptions 
to  generate  multi-unit  networks,  as  in  traditional  feedforward 
architectures,  additional  capabilities  for  more  complex  time- 
series  processing  tasks  may  be  achieved. 
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